Goto

Collaborating Authors

 training recurrent neural network


On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.


Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

This paper shows that GD/SGD can minimize the training loss of RNNs with linear convergence rate assuming the hidden layer width is sufficiently large (polynomial in data size and time horizon length). In order to prove this, the authors show that within a small region around the initialization, the norm square of the gradient can be lower bounded by the function value (Theorem 3). The authors further show that the loss function is somewhat smooth (Theorem 4), which guarantees that moving in the negative gradient direction can decrease the function value. This paper builds new techniques to analyze multi-layer ReLU networks. This paper shows that with appropriate initialization, ReLU activations avoid exponential exploding and exponential vanishing.


Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

This paper proves poly-time convergence of SGD/GD in over-parametrized RNNs for the first time. Given that there is not many theoretical results in this space. All reviewers find this result a significant progress.


On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length L, which is analogous to feedforward networks of depth L .


Exploring Flip Flop memories and beyond: training recurrent neural networks with key insights

arXiv.org Artificial Intelligence

Training neural networks to perform different tasks is relevant across various disciplines. In particular, Recurrent Neural Networks (RNNs) are of great interest in Computational Neuroscience. Open-source frameworks dedicated to Machine Learning, such as Tensorflow and Keras have produced significant changes in the development of technologies that we currently use. This work aims to make a significant contribution by comprehensively investigating and describing the implementation of a temporal processing task, specifically a 3-bit Flip Flop memory. We delve into the entire modelling process, encompassing equations, task parametrization, and software development. The obtained networks are meticulously analyzed to elucidate dynamics, aided by an array of visualization and analysis tools. Moreover, the provided code is versatile enough to facilitate the modelling of diverse tasks and systems. Furthermore, we present how memory states can be efficiently stored in the vertices of a cube in the dimensionally reduced space, supplementing previous results with a distinct approach.


On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$.


[R] Training Recurrent Neural Networks as a Constraint Satisfaction Problem โ€ข r/MachineLearning

@machinelearnbot

Obviously not the paper author, but this looks quite interesting. Mostly the fact that it finds all the local minima and can thus select the global minimum from them is nice. Though it would have been nice to see what the tradeoff is in terms of computational space and time complexity compared to error backpropagation.


What is Teacher Forcing for Recurrent Neural Networks? - Machine Learning Mastery

#artificialintelligence

Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input. It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications. In this post, you will discover the teacher forcing as a method for training recurrent neural networks. What is Teacher Forcing for Recurrent Neural Networks? Photo by Nathan Russell, some rights reserved.


A Gentle Introduction to Exploding Gradients in Neural Networks - Machine Learning Mastery

#artificialintelligence

Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training. This has the effect of your model being unstable and unable to learn from your training data. In this post, you will discover the problem of exploding gradients with deep artificial neural networks. A Gentle Introduction to Exploding Gradients in Recurrent Neural Networks Photo by Taro Taylor, some rights reserved. An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.